Fix UTF-8 code units to match the number of bytes by vinistock · Pull Request #4098 · ruby/prism

vinistock · 2026-04-29T21:21:22Z

I was seeing some weird results for UTF-8 in the Ruby LSP and I realized we actually made a mistake when implementing the code unit offsets. The idea is for this method to return code units (not code points!). That means:

UTF-32: number of characters
UTF-16: the weird code unit count with 1/2 byte characters being length 1 and 3/4 byte characters length 2
UTF-8: number of bytes

Spec reference: see the comments above each position encoding kind.

We are actually not returning the number of bytes, but the string length, which is incorrect. It's easy to see the mistake in the tests: the location of an emoji (4 bytes) was 1 before (single character).

This PR makes sure that we're returning the number of bytes for UTF-8, which is the amount expected for code units.

Basically, I added a UTF-8 counter that just returns the number of bytes and tried to ensure naming consistency across the counters.

kddnewton · 2026-04-30T12:22:00Z

To get typecheck to pass you need to run BUNDLE_GEMFILE=gemfiles/typecheck/Gemfile bundle exec rake typecheck:rbs_inline typecheck:rbi

vinistock · 2026-04-30T13:28:33Z

Just FYI: running the command generated other RBI/RBS changes that aren't related to this PR, so I think the files might be out of sync in main despite not causing any type checking failures.

I only included the updates for parse result.

Earlopain

👍 looks good, thanks!

(ruby/prism#4098) ruby/prism@442bd904ed

vinistock self-assigned this Apr 29, 2026

vinistock requested a review from kddnewton April 29, 2026 21:27

vinistock force-pushed the vs_fix_utf_8_code_units branch 2 times, most recently from 8dd6d5b to 817ec5a Compare April 29, 2026 21:41

Earlopain reviewed Apr 30, 2026

View reviewed changes

Comment thread lib/prism/parse_result.rb Outdated

Fix UTF-8 code units to match the number of bytes

846dc81

vinistock force-pushed the vs_fix_utf_8_code_units branch from 817ec5a to 846dc81 Compare April 30, 2026 13:27

vinistock requested a review from Earlopain April 30, 2026 13:38

Earlopain approved these changes Apr 30, 2026

View reviewed changes

Earlopain merged commit 442bd90 into ruby:main Apr 30, 2026
61 checks passed

matzbot pushed a commit to ruby/ruby that referenced this pull request Apr 30, 2026

[ruby/prism] Fix UTF-8 code units to match the number of bytes

df62312

(ruby/prism#4098) ruby/prism@442bd904ed

vinistock deleted the vs_fix_utf_8_code_units branch April 30, 2026 14:27

vinistock mentioned this pull request Apr 30, 2026

Encapsulate location encoding in response builders Shopify/ruby-lsp#4089

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UTF-8 code units to match the number of bytes#4098

Fix UTF-8 code units to match the number of bytes#4098
Earlopain merged 1 commit intoruby:mainfrom
Shopify:vs_fix_utf_8_code_units

vinistock commented Apr 29, 2026 •

edited

Loading

Uh oh!

kddnewton commented Apr 30, 2026

Uh oh!

Uh oh!

vinistock commented Apr 30, 2026

Uh oh!

Earlopain left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vinistock commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kddnewton commented Apr 30, 2026

Uh oh!

Uh oh!

vinistock commented Apr 30, 2026

Uh oh!

Earlopain left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vinistock commented Apr 29, 2026 •

edited

Loading